Analyzing and Modeling the Performance of the 
HemeLB Lattice- Boltzmann Simulation Environment 



Derek Groen 1 , James Hetherington 1 , Hywel B. Carver 1 , Rupert W. Nash 1 , 
Miguel O. Bernabeu 1 , Peter V. Coveney 1 

Centre for Computational Science, University College London, London, United Kingdom, 

e-mail: d.groen@ucl.ac.uk 



Abstract 

We investigate the performance of the HemeLB lattice-Boltzmann simula- 
tor for cerebrovascular bloodflow, intended to provide timely and clinically 
relevant assistance to neurosurgeons. HemeLB is optimized for sparse ge- 
ometries and supports interactive use, and scales well to 32,768 cores for 
problems with ~81 million lattice sites. We obtain a maximum performance 
of 29.5 billion site updates per second, and show only an 11% slowdown for 
highly sparse problems (5% fluid fraction). We present steering and visual- 
ization performance measurements and provide a model which allows users 
to predict the performance, thereby determining how to run simulations with 
maximum accuracy within time constraints. 

Keywords: lattice-Boltzmann, parallel computing, high-performance 
computing, performance modeling 



1. Introduction 

Recent progress in imaging and computing technologies has resulted in 
important advances in physiology. Using modern imaging methods, we are 
now able to scan the geometry of individual vessels within patients and map 
out potential sites for vascular malformations such as intracranial aneurysms. 
Likewise, recent increases in computational capacity and algorithmic im- 
provements in simulation environments allow us to simulate blood flow in 
great detail. The HemeLB lattice-Boltzmann application pQ aims to com- 
bine these two developments, thereby allowing medical scans to be used as 
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input for blood flow simulations. It also enables clinicians to run such simu- 
lations in real-time, providing runtime visualization feedback as well as the 
ability to steer the simulation and its visualization |2]. One principal goal 
for HemeLB is to act as a production toolkit that provides both timely and 
clinically relevant assistance to surgeons. To achieve this we must not only 
perform extensive validation and testing for accuracy, reliability, usability 
and performance, but also ensure that the legal environment and the medi- 
cal and computational infrastructure are made ready for such use cases [3]. 

In this work we investigate the performance aspects of the HemeLB en- 
vironment, taking into account the core lattice-Boltzmann (LB) simulation 
code and the visualization and steering facilities. We present performance 
measurements from a large number of runs using both sparse and non-sparse 
geometries and the overheads introduced by visualization and steering. Med- 
ical doctors treating patients with intracranial aneurysms are frequently con- 
fronted with very short time scales for decision-making. For HemeLB to be 
useful in such environments, it is therefore not only essential that the code 
simulates close to real-time, but also that the length of a simulation can be 
reliably predicted in advance. We demonstrate that it is possible to accu- 
rately characterise CPU and network performance at low core counts and 
integrate this information into a model that predicts performance for arbi- 
trary problem sizes and core counts. 

1.1. Overview of HemeLB 

HemeLB is a massively parallel lattice-Boltzmann simulation framework 
that allows interactive use in a medical environment. Segmented angio- 
graphic data from patients can be read in by the HemeLB Setup Tool, which 
allows the user to indicate the geometric domain to be simulated using a 
graphical user interface. The geometry is then discretized into a regular 
grid, which is used to run HemeLB simulations. 

The core HemeLB code, written in C++, consists of a parallelized lattice- 
Boltzmann application which is optimized for sparse geometries such as vas- 
cular networks. It employs the parallelized ParMETIS graph-partitioning 
library to construct a load-balanced domain decomposition at runtime, al- 
lowing the user to run simulations at varying core counts with the same 
simulation domain data. HemeLB is highly scalable due to a well optimized 
communication strategy and the locality of interactions and communications 
in the parallelized lattice-Boltzmann algorithm. 
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The HemeLB Steering Client is a light-weight tool that allows users to 
connect remotely to their HemeLB simulation, receive real-time visual feed- 
back and modify parameters of the simulation at runtime. In our work we 
run HemeLB with the steering server code enabled. As a result, one core is 
reserved for steering purposes, whether or not a client is connected, and is 
thereby excluded from the LB calculations. 

HemeLB uses a coalesced asynchronous communication strategy to op- 
timize its scalability. This system bundles all communications for each it- 
eration (e.g., exchanges required for the LB algorithm, steering and visu- 
alizations) into a single combined communication operation. As a result, 
each iteration of HemeLB 's core loop has only one MPI_Wait synchronisation 
point, minimizing the latency overhead of HemeLB simulations. We imple- 
mented this coalesced communication system by abstracting communications 
within a class capable of accumulating pointers into send and receive buffers, 
composing these pointers into a new temporary MPI datatype, and perform- 
ing the asynchronous sends, receives, and waits. Communication of variable 
length data is spread over two iterations, the sizes being transferred during 
the first iteration while the actual exchange takes place during the second 
one. 

The coalesced communication system is also used for the phased broadcast 
and reduce operations which are required for the visualization and steering 
functionality. Here HemeLB arranges the processes into an n-tree and, for 
broadcasts, sends data from one level of the tree to the level below over 
successive iterations. For reductions, data is sent up one level of the tree over 
successive iterations. Hence, both operations can take 0(log(p)) iterations, 
for p cores. In this approach is HemeLB does require some additional memory 
for communication buffers. Additionally, the responsiveness of the steering is 
constrained, as data arriving in the top- most node takes 0(log(p) iterations 
to be spread to all nodes. 

1.2. Related work 

A large number of researchers have investigated the performance aspects 
of various LB simulation codes over the past decade. These investigations 
have been done using non-sparse geometries and without real-time visualiza- 
tion or steering enabled, whereas we present a performance analysis of both 
sparse geometries and interactive usage modes in this work. Pohl et al. jl] 
compared the performance of LB codes across three supercomputer architec- 
tures, and concluded that the network and memory performance (bandwidth 
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and latency) are dominant components in establishing a high LB calcula- 
tion performance. Geller et al. [5] compared the performance of an LB code 
with that of several finite element and finite volume solvers, and deduced 
that LB offers superior efficiency in flow problems with small Mach numbers. 
Several groups have considered the performance of LB solvers on general- 
purpose graphics processing unit (GPGPU) architectures. In these studies, 
they introduced a number of improvements, such as non-uniform grids [6], 
more efficient memory management strategies [8] and LB codes which run 
across multiple GPUs [9j. Other performance investigations include a com- 
parison between different LB implementations [TU], hybrid parallelizations 
for multi-core architectures in general [TTJ E] and performance analysis of LB 
codes on Cell processors p21 EEl E] . 

A few studies within the medical domain are of special relevance to this 
work. These include a performance analysis of a blood-flow LB solver using 
a range of sparse and non-sparse geometries (15] and a performance predic- 
tion model for lattice-Boltzmann solvers [16J. This performance prediction 
model can be applied largely to our HemeLB application, although HemeLB 
uses a different decomposition technique and performs real-time rendering 
and visualization tasks during the LB simulations. Mazzeo and Coveney [1] 
studied the scalability of an earlier version of HemeLB. However, the current 
performance characteristics of HemeLB are substantially enhanced due to 
numerous subsequent advances in the code. 

2. Performance analysis 

We benchmarked HemeLB using simulation domains based on three dis- 
tinct geometries, a vascular network (see Fig. |2j used to generate three sim- 
ulation domains), a bifurcation of vessels (see Fig. [TJ used to generate two 
simulation domains) and a cylinder. Both the network and the bifurcation 
geometries are sections of an intracranial vasculature model that has been 
constructed from multiple rotational angiography scans of a patient with an 
intracranial aneurysm treated at the U.K. National Hospital for Neurology 
and Neurosurgery. The third and least sparse geometry is an artificially 
created cylinder. We present an overview of the simulation domains we gen- 
erated and use in our runs in Table [TJ We also provide a brief description of 
the sparseness of each generated simulation domain. 
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Figure 1: Graphical overview of the bifurcation geometry in the HemeLB Setup Tool. We 
used this geometry to generate the Bifurcation and Large Bifurcation simulation domains. 
Inlets are shown by green planes, outlets by red planes. 

Table 1: Overview of the simulation domains used in our experiments. The percentage of 
the simulated box that consists of active fluid sites is given by the fluid fraction. Non-active 
fluid sites d o not count towards the number of lattice sites in the simulation. 



Name 


# of lattice sites 


fluid fraction 


Bifurcation 


19,808,107 


11% 


Cylinder 


15,607,040 


65% 


Network 


18,836,545 


5.1% 


Large Bifurcation 


81,132,544 


11% 


Large Network 


44,650,496 


5.1% 


Small Network 


77,182 


5.1% 



2.1. Performance of LB computations 

We have run blood flow simulations using the simulation domains listed 
in Table [T] using up to 32,768 cores on the HECToR Phase 3 supercom- 
puter at EPCC in Edinburgh, United Kingdom. The HECToR machine is a 
Cray XE6, contains 2.3GHz AMD Opteron 6276 processors, and has a peak 
performance of 9.2 GFLOP/s per core. Our simulations were done using 
a 15-directional lattice-Boltzmann kernel (D3Q15), the Lattice Bhatnagar- 
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Figure 2: Graphical overview of the network geometry in the HemeLB Setup Tool. We 
used this geometry to generate the Network, Large Network and Small Network simulation 
domains. 

Gross-Krook [17] model with simple bounce-back boundary conditions and a 
fixed physical viscosity of 0.004 Pas. We present the scalability results for all 
simulation domains in Figure |3j We find that the small network simulation 
domain scales near-linearly up to 128 cores, despite consisting of only 77,182 
lattice sites. All of the medium-sized simulation domains (Bifurcation, Cylin- 
der and Network) scale linearly to 8,192 cores. However, the communication 
overhead and load imbalance reduce the performance on higher core counts. 
The two largest simulation domains (Large Bifurcation and Large Network) 
show superlinear scaling to 16,384 cores, and significant speedup to 32,768 
cores, achieving a maximum performance of 29.5 billion site updates per sec- 
ond (SUPS). The performance obtained at 8,196 cores for the medium-sized 
bifurcation corresponds to 419 timesteps per second, or 646 times slower 
than real-time for a maximum timestep as limited by incompressibility con- 
straints. The maximum timestep here is estimated by the need to keep the 
Mach number below 0.05, using a typical blood velocity for vessels of this 
size of 25 cm/s. At this rate, it takes HemeLB 553 seconds to simulate one 
heartbeat with a resolution of around 100 lattice points across a vessel di- 
ameter. We present the performance in SUPS per core as a function of the 
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Figure 3: Lattice site updates per second (SUPS) as a function of the number of cores 
used for simulations run on the HECToR Cray XE6 machine. We run simulations using 
each of the six simulation domains (Cylinder, Network, Bifurcation, Large Bifurcation, 
Large Network and Small Network. 

number of sites per core in Figure |4j demonstrating that the SUPS per core 
is largely independent of other factors. 

2.2. Visualization performance 

One of the features that sets HemeLB apart from many other LB codes 
is its ability to perform in situ rendering of the geometry at runtime [2], 
using a parallelized ray-tracing algorithm. The communication needs of the 
ray-tracing algorithm have been combined with those of the main simulation 
algorithm, through the coalesced communication strategy, massively improv- 
ing the scaling when rendering frames. The images rendered by HemeLB can 
either be stored on disk for future reference or they can be forwarded as a 
streaming visualization to the steering client. In this section we present 
several experiments where we assess the overhead introduced by rendering 
images, as well as that introduced by writing snapshots of the simulation 
data. These snapshots store the hydrodynamic variables at each lattice point, 
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Figure 4: Site updates per second (SUPS) per core averaged over all cores used in the 
simulation (excluding the one used for steering) as a function of the number of sites per 
core, for six LB problems. 



recording all information of physical relevance which is useful for visualiza- 
tion and post-processing. We have run four types of simulations using the 
Bifurcation simulation domain, one with snapshots and image-rendering dis- 
abled, one where we write snapshots to disk (10 snapshots per 1000 time 
steps), one where we render and write images to disk (10 rendered images 
per 1000 time steps) and one with both snapshots and images enabled. We 
have carried out the tests using 256, 512, 1024 and 2048 cores. We present 
our results in Fig. [5| Here the overhead for rendering and writing images 
is marginal, and adds no more than a few percent to the execution time in 
most cases. Simulations which have snapshot writing enabled are both con- 
siderably slower and have more irregular performance characteristics. When 
the simulation writes 10 snapshots per 1000 LB steps, we observe an increase 
in the wall-clock time of ~24 seconds. 

We have also run several simulations of 1000 LB steps where we render 
and write an image to disk every 5 to 200 LB steps. The results for these runs 
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Figure 5: Time spent to simulate 100 time steps as a function of the number of cores 
used for four settings: no snapshots and no images (blue), images only (green), snapshots 
only (black) and snapshots and images (red) . We averaged the results from runs including 
any form of snapshot or image writing over three executions, and included a standard 
deviation error bar with each data point. 

(which were done using 1024 and 2048 cores) are given in Fig. |6} Without 
rendering the simulations took 31.4, 16.1 and 7.81 seconds on 512, 1024 and 
2048 cores respectively. We observe an overhead of less than two seconds per 
1000 LB steps if we render and write no more than 10 images during that 
period. However, the performance deteriorates somewhat when we write 
more images, with a maximum measured overhead of ~6.5 seconds. We 
also again observe some jitter in our results, for example in the 1024 core 
simulation that rendered one image every 50 steps, which we attribute to 
fluctuations in the file system performance of the machine. Rendering one 
image per 5 LB steps using 2048 cores corresponds to a frame rate of about 
13.6 frames per second, more than sufficient for smooth visualizations of the 
simulations in real time. 
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Figure 6: Overhead in seconds relative to the simulation time without images rendered 
as a function of the number of LB steps per image rendered and written. The simulation 
with images rendered took 31.4, 16.1 and 7.81 seconds on respectively 512, 1024 and 
2048 cores. We averaged the measurement of the runs over three executions. 



2. 3. Steering performance 

The previous subsection isolates the performance impact of the visu- 
alization and rendering, with images written to disk. Here we study the 
performance impact of the HemeLB steering component, where images are 
streamed over the network to a client. In this case, HemeLB produces images 
as described in Section 2.2 optionally limited by a maximum frame-rate per 
second. We also look at the performance impact of sending steering messages 
from the client to the HemeLB steering component. In order to obtain re- 
producible data, the steering client is set up with a scripted set of simulated 
user actions (orbiting the view point for image rendering). These results are 
presented in Table [2] and in Figure [7] and were produced with the steering 
client running on the HECToR login node. For a frame-rate of 4.6 frames per 
second, which is usable for scientific steering, with bidirectional communica- 
tion between client and server, corresponding to 32 LB steps per rendered 
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Table 2: Performance impact of running HemeLB with a connected steering client using 
1024 and 2048 cores. Here the mode is the method of running HemeLB, which can be 
without client (none), with the client used only for image streaming (images) or with the 
client used both for image streaming and steering the HemeLB simulation (both). 



V 


mode 


frame-rate (1/s) 


A /TOT TOO 


mean LB steps 






requested 


achieved 


per core 


per image 


1024 


none 






1.39 




1 D9/1 


DOTjII 


2.0 


2.0 


1 98 




1024 


images 


2.0 


2.1 


1.25 


39.3 


1024 


both 


5.0 


4.4 


1.11 


16.5 


1024 


images 


5.0 


4.8 


1.02 


13.8 


1024 


both 


max 


5.9 


0.84 


11.3 


1024 


images 


max 


8.2 


0.76 


6.0 


2048 


none 






1.46 




2048 


images 


2.0 


2.1 


1.26 


77.2 


2048 


both 


2.0 


2.2 


1.32 


78.6 


2048 


both 


5.0 


4.6 


1.15 


32.2 


2048 


images 


5.0 


4.8 


0.99 


26.9 


2048 


images 


max 


9.5 


0.59 


8.0 


2048 


both 


max 


10.6 


0.66 


8.1 



image, we observe an overhead of 28%. 

2.4- Performance comparison with other codes 

In this section we compare the performance of HemeLB with performance 
measurements of other LB codes as found in the literature. We gathered the 
number of million lattice site updates per second (MSUPS), the standard 
measure of LB performance, reported for other implementations. HemeLB is 
strongly optimized for efficiently handling sparse geometries while most codes 
are not, making like-for-like comparison difficult. The other applications may 
not be capable of simulating even moderate complexity domains, such as a 
cylinder, at all or only at the cost of allocating memory to non-fluid sites. 
Additionally, the directional resolution affects the number of calculations and 
memory accesses required per site update, as well as the presence of other 
special features (e.g., most notably in LB3D [IS]). 

We found no MSUPS measurements in the literature for any benchmarks 
that use sparse geometries, so all results presented here from the other codes 
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Figure 7: Performance impact of running HemeLB with a connected steering client. We 
show results for 1024 and 2048 cores without steering client (plotted at frame-rate zero), 
with the client used only for image streaming (images) and with the client used both for 
image streaming and steering the HemeLB simulation (both). 

are for bulk flow only. We provide the LB performance configurations and 
results for several well-known LB codes in Tables |3] and |U The MSUPS 
per core results here are obtained by dividing the total number of lattice 
site updates by the product of time spent on LB iterations and the number 
of cores. From each literature source, we picked the result from the run 
that showed the best MSUPS per core while running on at least one full 
processor. In the case of HemeLB we picked the best result from the non- 
sparse Cylinder, as well as from the very sparse Network and Large Network 
simulation domains, which are the only measurements in the tables using 
sparse geometries. 

When we examine bulk flow only, the MSUPS per core performance of 
HemeLB is comparable with that achieved with LBMHD (although LBMHD 
calculates in 27 directions and HemeLB in 15), and about half of that 
achieved with Palabos on similar AMD Opteron architectures. The perfor- 
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Table 3: Technical specifications of 12 LB simulations in our code comparison. We provide 
the name of the LB application used in the first column (including the source), followed 
by respectively the architecture used for the simulations and the number of cores used for 
the rum 



Name 


Architecture 


cores 




(peak GFLOPS/core) 




HemeLB (Cylinder) 


AMD Opteron 6276 (9.2) 


4096 


HemeLB (Network) 


AMD Opteron 6276 (9.2) 


32 


HemeLB (Large Network) 


AMD Opteron 6276 (9.2) 


512 


LBMHD p3] 


AMD Opteron 2356 (9.2) 


8 


Palabos [20] 


AMD Opteron 8356 (9.2) 


4 


HYP04D (Groen p.c.) 


BlueGene/P (3.40) 


512 


LB3D (Schmieschek p.c.) 


BlueGene/P (3.40) 


256 


LBMHD P3] 


Intel Xeon E5345 (9.33) 


8 


LUDWIG [21] 


BlueGene/L (2.8) 


32 


MUPHY [22] 


BlueGene/L (2.8) 


32 


Palabos [20] 


BlueGene/P (3.40) 


256 


Palabos [20] 


Intel Xeon X5550 (10.64) 


4 



mance of HemeLB, however, is almost entirely preserved when using a very 
sparse simulation domain as HemeLB does not allocate memory or compu- 
tational effort for non-active lattice sites, which are by definition common 
in sparse geometries. LBMHD has no known optimizations for sparse ge- 
ometries while Palabos features a partial optimization using the multi-block 
method[l9], of which we found no performance data using sparse geometries 
in the literature. The multi-block method is relatively inefficient because it 
allocates memory to some of the non-fluid sites and uses data structures that 
grow in complexity when off-lattice geometries are modelled more accurately. 
When a code is not designed for sparse geometries, additional optimizations 
(e.g., cache lookahead) are simpler to implement, hence the performance of 
a code which supports sparse geometries may not match that of codes which 
exploit such optimizations. Many of the benchmarks for other LB codes were 
performed on non-Opteron architectures, making it difficult if not impossi- 
ble to do a one-on-one comparison. We nevertheless include these results for 
reference in the lower part of Table [3j 
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Table 4: Performance comparison of 12 LB simulations in our code comparison. We 
provide the name of the LB application used in the first column, followed by the number 
of lattice sites for each run, the directionality, and the obtained performance per core. We 
give the per core calculation performance in millions of site updates per second (MSUPS). 
In the case of LBMHD we assumed 1300 FLOPs per lattice operation, as mentioned in 
Williams et al. [13] . Runs that use a sparse simulation domain are marked with an asterisk. 
Runs that only use shared memory strategies only (no MPI) are marked with two asterisks. 



Name 


# of lattice sites 


directional 


MSUPS 






resolution 


per core 


HemeLB (Cylinder) 


15,607,040 


D3Q15 


1.41 


HemeLB* (Network) 


18,836,545 


D3Q15 


1.20 


HemeLB* (Large Network) 


44,650,496 


D3Q15 


1.19 


LBMHD** 


2,097,152 


D3Q27 


1.36 


Palabos** 


64,481,201 


D3Q19 


2.55 


HYP04D 


452,984,832 


D3Q19 


0.273 


LB3D 


2,147,483,648 


D3Q19 


0.172 


LBMHD 


2,097,152 


D3Q27 


0.538 


LUDWIG 


16,777,214 


D3Q19 


0.087 


MUPHY 


262,144 


D3Q19 


0.529 


Palabos 


1,003,003,001 


D3Q19 


0.891 


Palabos 


64,481,201 


D3Q19 


7.87 



Table 5: List of constant values used in our performance model. The A value was measured 
using a ping test between nodes on HECToR. 



Constant name 


Value 


T 


1.57 x 10 6 SUPS per core (calc only) 


A 


2.5 x 10- 5 [s] 


a 


160 MB/s per core 


Ccalc 


1.04 


Ccomm 


1.5 


^monitoring 


0.06 



3. Modeling the performance of HemeLB 

3.1. Parameter extraction 

Before we are able to construct and apply the performance model, we 
need to extract a number of parameters specific to HemeLB. 
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Figure 8: Number of bytes sent per LB simulation step as a function of the number of 
lattice sites per core for Cylindrical geometries (measurements have been done using the 
Cylinder simulation domain). The fits we use for Cylindrical, Bifurcation and Network 
geometries in our performance model are given respectively by the red, blue and black 
dashed lines. Error bars show one standard deviation for the distribution across cores. 



3.1.1. Characterizing communication volume 

To model the communication performance of HemeLB we need informa- 
tion on the amount of data communicated between processes at each step. 
As the domain decomposition in HemeLB is done at runtime [I], we can 
only know the exact communication data volume after we have launched 
the simulation. To preserve the predictive power of the performance model, 
we have instead measured the communication volume for the three types of 
simulation domains across a range of core counts. After having performed 
the measurements, we fitted the data to a function of the form ax b to gain 
an approximate estimate while keeping the model as simple as possible. We 
present our measurements of the communication volume and our fits for the 
cylindrical geometries in Figure [8j for the bifurcation geometries in Figure [9j 
and for the network geometries in Figure 10 We provide the exact formula- 
tion of the three fits in Table |6] 
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Figure 9: As in Fig. [8] but for the Bifurcation geometry (using the Bifurcation simulation 
domain) . 



3.1.2. Characterizing load imbalances 

When using sparse geometries, individual processes within HemeLB con- 
tain subsets of the simulated system with heterogeneous shapes and sizes. 
These differences result in two types of load imbalance during the parallel 
LB calculation: a calculation load imbalance and a communication load im- 
balance. To obtain a platform-independent measure of the load imbalance in 
HemeLB, we choose not to include timing results in this procedure. Instead, 
we examine the number of lattice sites on each core to determine the calcu- 
lation imbalance and the number of bytes sent by each process to determine 
the communication imbalance. Both metrics are reproducible on different 
platforms when using the same version of ParMETIS (4.0.2). 

In Fig. 11 we show the measured calculation load imbalance for three 
geometries as a function of the core count. We determine this calculation 
load imbalance by dividing the maximum number of lattice sites on any core 
within this run by the average number of sites over all cores in the same run. 
HemeLB is optimized for calculation load balance and we find an imbalance of 
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Figure 10: As in Fig. [8] but for the Network geometry (using the Network simulation 
domain) . 



less than 1.04 for most core counts. However, the calculation load imbalance 
is higher for both very low and very high core counts. This load imbalance 
contributes in part to the superlinear scaling of HemeLB at lower core counts 
in some cases, and reduces scalability when there are less than 2000 lattice 
sites per core. Based on these measurements, we assume that Ccaic = 1-04 in 
our performance model. 



In Fig. 12 we present the communication load imbalance, which we mea- 
sure by dividing the maximum number of bytes sent by a single core in 
the run by the average number of bytes sent per core. All the communi- 
cation measurements are given per step, averaged over at least 100 simula- 
tion steps. We observe a large and erratic imbalance in the communication 
sizes. The ParMETIS domain distribution algorithm co-optimizes for both 
calculation load balance and communication minimization. However, these 
results suggest that it does not optimize for communication balance. This 
communication imbalance does not strongly diminish the code performance 
unless the performance is already dominated by communication. Within our 
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Table 6: List of fitting functions used in our performance model. Here the total number 
of lattice sites is given by TV and the number of cores used by p. 
Constant name Value 

^cylinder 1898 x ( jV/p) - 482719 bytes per core per step 

^bifurcation 942.0 x ( jy/p) - 595517 bytes per core per step 

^network 1176 x ( jV/p) - 613449 bytes per core per step 




Figure 11: Imbalance of the number of sites per core (i.e., our measure for calculation load 
imbalance) as a function of the number of cores for the three geometries. The value on 
the y-axis is the relative calculation overhead caused by load imbalance. These values are 
deterministic for a given core count and ParMETIS version. 



model we take an approximate average of our measurements, and assume 
that ( comm — 1.5. 

3.2. LB calculations 

To model the performance of the core LB simulator code we propose a 
time-complexity model which is loosely based on [16] but largely simplified. 



We use a range of parameters which we derived in Section 3.1 In this model 
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Figure 12: Imbalance in the number of bytes sent (i.e., our measure for communication 
load imbalance) as a function of the number of cores for the three geometries. The value 
on the y-axis is the relative communication overhead caused by load imbalance. These 
values are deterministic for a given core count and ParMETIS version. 



we approximate the overall time spent to perform a single simulation step in 
HemeLB (T step ), using 

rp Ccalc X T ca \ c -\- ( mmm X T comm 

step = i n — n ' ^ ' 

l.U ^monitoring 

where T ca i c is the average calculation time per core, (Ccaic) is the calculation 
load imbalance constant, T comm is the communication time per core, Ccomm 
is the communication load imbalance constant and mon i tO ring is the fraction 
of time spent on monitoring overhead. Throughout our runs we found that 
~ 6% of the runtime is spent on monitoring, so we define O moni toring = 0.06. 
The average calculation time per core is given by 

^calc = (2) 

T 
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Here, the total number of lattice sites is given by iV and the number of cores 
by p. We define the SUPS per core r as a platform dependent constant for 
the HECToR machine in Table |5j We measured r as an average from our 
HemeLB runs with 32 cores (1 node). The true SUPS capacity per core 
depends slightly on the number of sites per core, but is in almost all cases 
within 20% of this average value. We model the time spent on communica- 
tions, T comm , using 

S 

T comm = log 2 (p) x A + (3) 

a 

where A is the point-to-point latency of MPI communications between nodes 
in seconds, and a the average throughput capacity per core in bytes. We 
assume that the number of messages exchanged per time step increases with 
the number of processes and we model this as log(p). The number of bytes 
sent out per core per step (S <x> ) is dependent on the geometry used as 
well as the number of sites per core. We have provided basic fits for three 
geometry layouts with different sparsity (network, bifurcation and cylinder) 
in Table HI These fits are most accurate for simulations that have between 
5,000 and 200,000 sites per core. 

3. 3. Visualization 

When image rendering and writing is enabled in HemeLB, some overhead 
is introduced in the execution, and the new time per step (T step vis ) becomes 



-^~step_vis ^step -|- 0; ma g e s, (4) 

where O images is the overhead for rendering and writing images. Because our 
overhead measurements show a large variability, we use a straightforward 
fit rather than a detailed sub-model to approximate this overhead. Based 
on our measurements on 2048 cores, we have derived an approximate fit of 
^images = 21.6/c 0-76 , with k being the number of LB steps per rendered image. 



We provide a graphical overview of the approximation in Figure 13 



4. Model validation 

We have applied our performance model to calculate the theoretical exe- 



cution times of the simulations we presented in Section 2.1 The predictions 
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Figure 13: Overhead in seconds relative to the simulation time without images rendered 
as a lunction of the number of LB steps per image rendered and written. Error bars are 
present in Figure [6j but omitted here for clarity. The prediction of the performance model 
is given by the thick solid red curve. 



given by the model, as well as the measurements presented earlier, can be 
found in Figure 14 for the Cylinder, Bifurcation and Large Bifurcation sim- 
ulation domains and in Figure 15 for the Network, Small Network and Large 
Network simulation domains. The predictions from our model are generally 
in agreement with our measurements, especially for the larger simulation 
domains. However, the model does not predict the superlinear speedup mea- 
sured in the results. This is mainly due to the relatively large calculation 
and communication load imbalances we find at low core counts. 



5. Discussion 

We have presented a range of performance measurements for HemeLB, 
covering the lattice Boltzmann simulation and the visualization and steer- 
ing functionalities. HemeLB scales near-linearly up to 32,768 cores, even for 
highly sparse simulation domains such as vascular networks. The application 
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Figure 14: Wall-clock time spent to simulate 100 time steps as a function of the number 
of cores used for the Cylinder, Bifurcation and Large Bifurcation simulation domains. 
Predictions by our performance model are indicated by the dashed lines. 

achieves close to maximum efficiency when using between 5,000 and 500,000 
lattice sites per core. We have shown that HemeLB can render and write im- 
ages once every 100 timesteps with an overhead of ~10% and share streaming 
images and control with a steering client at 4.6 fps with a 28% overhead. We 
have demonstrated that it is possible to create a model which can estimate 
the run time of HemeLB simulations in advance. In our validation tests, we 
find that the predictions are between 70% and 140% of the actual runtime for 
simulations with at least 5000 lattice sites per core. We believe that accurate 
runtime predictions will be useful when HemeLB is used in a clinical setting, 
as doctors will be able to select the simulation with the highest accuracy that 
still meets the deadline for actual treatment. 

To improve the accuracy of HemeLB simulations, as part of the MAP- 
PER project [23], we are developing an intercommunication layer that allows 
the code to exchange boundary information with other simulation codes. 
These couplings allow us to incorporate phenomena that are not resolved 
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Figure 15: As Fig. 14 but for the Network, Small Network and Large Network simulation 
domains. 



in HemeLB itself, such as the interaction between the blood flow in the in- 
tracranial vasculature and that in the rest of the human body. The boundary 
exchanges in these coupled simulations occur at high frequency and require 
rapid response times on both ends. The performance bottlenecks we have 
identified allow us to take the necessary steps to ensure an optimal perfor- 
mance for multiscale simulations using HemeLB. 

The envisaged use-case for HemeLB, involving deployment within a clini- 
cal setting is made more difficult by typical queueing and scheduling policies 
for supercomputers. One important benefit of supercomputing lies in en- 
abling results to be produced in a timely fashion. With typical scheduling 
policies, however, many codes produce results only after a lengthy wait in 
a queuing system, significantly reducing the value-added of the supercom- 
puting resource relative to a long-running simulation on a smaller machine. 
The value of supercomputing is particularly apparent when using interactive 
visualization and steering [2J, as this enables complex simulations to be in- 
vestigated on timescales close to those of human engagement. However, this 
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form of interaction is not possible without an advance reservation facility, 
enabling one to predict the time when one will be able to interact with the 
running simulation. 

In particular, in the clinical context, patients and physicians already in- 
teract within a complex resource availability and scheduling environment. 
In this case, advance reservation will be necessary to make computing re- 
sources available concurrently with medical equipment, physicians, and pa- 
tient needs. Furthermore, when HemeLB is used in a clinical context, rapid 
access to computing resources will become a safety-critical factor. This re- 
quires not just advance reservation, but support for urgent computing [3]. 
For the use cases we envisage for HemeLB, an urgent computing mechanism 
will need to be available on supercomputing resources. 
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